Joining Data in R

Using dplyr to join data tables

Objectives

This document outlines some common data joining tasks with the dplyr package. If you’ve ever Googled ‘how to perform [x] in R’, you’ve probably discovered there are multiple ways to accomplish [x]. The methods I present here are not exhaustive, and I’ve tried to balance the trade-off between 1) solutions/functions that solve problems with the fewest number of keystrokes, and 2) code that’s readable, explicit, and easy to follow.

Load packages

dplyr is part of the core tidyverse packages, so we install and load this meta-package below. We’ll also be installing the inspectdf, starwarsdb, and dm packages.

# install.packages(c("tidyverse", "inspectdf", "starwarsdb", 
#                   "dm", "patchwork", "kableExtra"))
# devtools::install_github("nsgrantham/ggdark")
library(tidyverse)
library(inspectdf)
library(starwarsdb)
library(dm)
library(patchwork)
library(kableExtra)
library(ggdark)

Namespacing

I’ll be using the namespace syntax (package::function()) just to make it explicitly clear which function belongs to which package, but this isn’t necessary if you’ve run the code block above.

Inspecting data tables

Whenever I’m performing a series of joins, I like to use the inspectdf package. This package has a series of functions for viewing all the columns in a table by various types:

There are three principles for tidy data:

  1. Columns hold variables
  2. Rows hold observations
  3. Cells (the intersection of rows and columns) hold values

Joins

Joins give us the ability to combine and filter multiple datasets on common column(s).

We’re going to be using the starwarsdb package, which dplyr::contains data from the Star Wars API.

Using inspectdf

We’re also going to use the inspectdf package to help understand and verify the contents of the joined tables. The inspectdf package will create plots that summarise missingness, categorical levels, numeric distribution, correlation, column types and memory usage.

To use the inspectdf functions,

  1. Pass your dataset to the visualization function (all start with inspect_)

  2. Then pass the output to the inspectdf::show_plot()

See the example below with inspectdf::inspect_na():

data |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot()

We also want to include the text_labels = TRUE argument in the inspectdf::show_plot() function:

data |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE)

The starwarsdb Data Model

The image below displays the data model for the tables in this package

sw_data_model <- dm::dm_draw(dm = starwarsdb::starwars_dm(),
  graph_name = "sw_data_model",
  view_type = "all")
sw_data_model

As you can see from the graph above, there are nine tables in the starwarsdb. The tables connect to each other through a series of common columns (called keys), and these allow us to perform multiple kinds of joins.

Create sw_films

Load the starwarsdb::films dataset, removing the opening_crawl, director, and producer columns, and filter to only episodes 4-6. Store these data in sw_flms

sw_flms <- starwarsdb::films |> 
  dplyr::select(-c(opening_crawl, director, producer)) |> 
  dplyr::filter(episode_id >= 4 & episode_id <= 6)
sw_flms |> dplyr::glimpse()
Rows: 3
Columns: 3
$ title        <chr> "A New Hope", "The Empire Strikes Back", "Retur…
$ episode_id   <int> 4, 5, 6
$ release_date <date> 1977-05-25, 1980-05-17, 1983-05-25

left_join

A left_join keeps all of x, and joins it to all matching rows from dataset y

We want to left-join sw_flms to starwarsdb::films_vehicles:

starwarsdb::films_vehicles |> dplyr::glimpse()
Rows: 104
Columns: 2
$ title   <chr> "A New Hope", "A New Hope", "A New Hope", "A New Hop…
$ vehicle <chr> "CR90 corvette", "Star Destroyer", "Sentinel-class l…

We can do this on "title", then we can re-organize the columns using dplyr::select() to place the vehicle column before dplyr::everything() else.

First we’ll take a look at the join before assigning to a new table:

dplyr::left_join(x = sw_flms, 
          y = starwarsdb::films_vehicles, 
          by = "title")  |>   
  dplyr::select(vehicle, 
    dplyr::everything()) |> 
  dplyr::glimpse()
Rows: 47
Columns: 4
$ vehicle      <chr> "CR90 corvette", "Star Destroyer", "Sentinel-cl…
$ title        <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…

The glimpse() output indicates the left_join() appears to be working, but we’ll pass the output to the inspectdf::inspect_na() function to verify:

dplyr::left_join(x = sw_flms, 
          y = starwarsdb::films_vehicles, 
          by = "title")  |>   
  dplyr::select(vehicle, 
    dplyr::everything()) |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE)

The inspectdf::inspect_na() function is a great tool for checking intermediate table results because it will show us if the result is going to have more missing values than expected.

The graph tells us there are no missing values in the left_join between sw_flms and starwarsdb::films_vehicles, so we’ll assign to sw_flms_veh.

dplyr::left_join(x = sw_flms, 
          y = starwarsdb::films_vehicles, 
          by = "title")  |>   
  dplyr::select(vehicle, 
    dplyr::everything()) -> sw_flms_veh

Let’s take a look at the sw_flms_veh table with inspectdf::inspect_types()

inspectdf::inspect_types()

The code below passes the new sw_flms_veh table to inspectdf::inspect_types().

sw_flms_veh |> 
  inspectdf::inspect_types() |> 
  inspectdf::show_plot(
    text_labels = TRUE)

We can see the 4 columns in this table (and the breakdown of each columns by type). Two of these columns are character, so we’ll use inspectdf::inspect_cat() to display the categorical variables:

inspectdf::inspect_cat()

The code below passes the new sw_flms_veh table to inspectdf::inspect_cat().

sw_flms_veh |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(
    text_labels = TRUE)

This tells us the release_date column has been formatted as a character, but none are missing. We’ll visualize the final episode_id column using inspectdf::inspect_num()

inspectdf::inspect_num()

The code below passes the new sw_flms_veh table to inspectdf::inspect_num().

sw_flms_veh |> 
  inspectdf::inspect_num() |> 
  inspectdf::show_plot(
    text_labels = TRUE) 

This graph is a little strange to interpret, but it’s essentially a rotated histogram, which we know how to create:

sw_flms_veh |> 
  ggplot2::ggplot(aes(x = episode_id)) + 
  ggplot2::geom_histogram() + 
  ggplot2::labs(
    title = "Histogram of episode_id in sw_flms_veh", 
    subtitle = "Verifying left-join between starwarsdb::films_vehicles and starwarsdb::films")

Below is a comparison of the two graphs–as you can see, inspectdf::inspect_num() uses Probability on the y axis, while the ggplot2::geom_histogram() uses the count on the y axis.

Now we have a process for verifying new joins:

  1. Pass the output to inspectdf::inspect_na() BEFORE assigning to new object

  2. Pass the new table to inspectdf::inspect_types()

  3. Pass the new table to inspectdf::inspect_cat()

  4. Pass the new table to inspectdf::inspect_num()

inner_join

The inner_join() joins dataset x and y, and keeps only matching rows from both.

inspectdf::inspect_na()

The code below uses an inner_join to join sw_films_veh to starwarsdb::pilots on "vehicle", and passes the output to inspectdf::inspect_na():

dplyr::inner_join(x = sw_flms_veh, 
           y = starwarsdb::pilots, 
           by = "vehicle") |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE)

We can see there are no missing values, so we will re-organize these columns and assign the output to sw_flms_veh_plt, then pass it to inspectdf::inspect_types().

inspectdf::inspect_types()

To distinguish these plots from the ones above, we’ll adjust the col_palette in the inspectdf::show_plot() function:

sw_flms_veh_plt <- dplyr::inner_join(x = sw_flms_veh, 
           y = starwarsdb::pilots, 
           by = "vehicle") |> 
  dplyr::select(pilot, vehicle, title, episode_id, release_date)
sw_flms_veh_plt |> 
  inspectdf::inspect_types() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

We can see these variables are all formatted correctly.

inspectdf::inspect_cat()

Next we’ll pass the sw_flms_veh_plt to the inspectdf::inspect_cat() function:

sw_flms_veh_plt |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

This shows our output table contains the new column from the starwarsdb::pilots ( and no missing values!).

We now have a table with pilot, vehicle, title, episode_id and release_date:

sw_flms_veh_plt |> glimpse()
Rows: 39
Columns: 5
$ pilot        <chr> "Chewbacca", "Han Solo", "Lando Calrissian", "N…
$ vehicle      <chr> "Millennium Falcon", "Millennium Falcon", "Mill…
$ title        <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…

And these are all from the first three films, so we can do things like get a total number of vehicles per film (using tidyr and dplyr):

sw_flms_veh_plt |> 
  # count the two columns
  dplyr::count(title, vehicle) |> 
  # move rows to columns
  tidyr::pivot_wider(names_from = title, values_from = n) |> 
  # select columns in order
  dplyr::select(vehicle, 
    contains("Hope"),
    contains("Empire"),
    contains("Return")
    # prettier table
    ) |> knitr::kable() 
vehicle A New Hope The Empire Strikes Back Return of the Jedi
Millennium Falcon 4 4 4
TIE Advanced x1 1 NA NA
X-wing 4 4 4
A-wing NA NA 1
AT-ST NA 1 1
Imperial shuttle NA 3 3
Imperial Speeder Bike NA NA 2
Slave 1 NA 1 NA
Snowspeeder NA 2 NA

Now we’re going to get some additional information on our pilots.

right_join

The right_join() includes all the rows in dataset y, and joins it the matching rows in dataset x

inspectdf::inspect_na()

The code below uses a dplyr::right_join() to join starwarsdb::species to starwarsdb::people on "title", keeping all the rows in starwarsdb::people. But we need to make some changes to the columns first,

  1. We need to remove the duplicate homeworld column
  1. We want to rename the duplicate name column
  1. Because the by columns have different names, we’ll have to adjust the by argument

inspectdf::inspect_na()

We’ll pass the output to inspectdf::inspect_na()

starwarsdb::species |> 
  dplyr::select(-homeworld) |> 
  dplyr::rename(species_name = name) |> 
  dplyr::right_join(
    y = starwarsdb::people,
    by = c("species_name" = "species")
    ) |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE, col_palette = 2)

This is showing quite a few missing values in the birth_year column.

However, we’re only interested in the species_name, classification, and designation columns from starwarsdb::species, and the sex, name, mass, height, and gender columns from starwarsdb::people, so we’ll use dplyr::select() to reduce this output, the use dplyr::inner_join() to join this to sw_flms_veh_plt (using by = c("name" = "pilot")):

starwarsdb::species |> 
  dplyr::select(-homeworld) |> 
  dplyr::rename(species_name = name) |> 
  dplyr::right_join(
    y = starwarsdb::people,
    by = c("species_name" = "species")
    ) |> 
  # reduce to only the columns we're interested in 
  dplyr::select(sex, name, mass, height, gender, 
                species_name, designation, classification) |> 
  # join to our pilots table
  dplyr::inner_join(y = sw_flms_veh_plt, 
    by = c("name" = "pilot")) |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE, col_palette = 2)

And we can see this reduces the number of missing values so we’ll assign the output to sw_pilots_4thru6

sw_pilots_4thru6 <- starwarsdb::species |> 
  dplyr::select(-homeworld) |> 
  dplyr::rename(species_name = name) |> 
  dplyr::right_join(
    y = starwarsdb::people,
    by = c("species_name" = "species")
    ) |> 
  # reduce to only the columns we're interested in 
  dplyr::select(sex, name, mass, height, gender, 
                species_name, designation, classification) |> 
  # join to our pilots table
  dplyr::inner_join(y = sw_flms_veh_plt, 
    by = c("name" = "pilot")) 
dplyr::glimpse(sw_pilots_4thru6)
Rows: 39
Columns: 12
$ sex            <chr> "male", "male", "male", "male", "male", "male…
$ name           <chr> "Luke Skywalker", "Luke Skywalker", "Luke Sky…
$ mass           <dbl> 77, 77, 77, 77, 77, 77, 77, 136, 49, 84, 84, …
$ height         <dbl> 172, 172, 172, 172, 172, 172, 172, 202, 150, …
$ gender         <chr> "masculine", "masculine", "masculine", "mascu…
$ species_name   <chr> "Human", "Human", "Human", "Human", "Human", …
$ designation    <chr> "sentient", "sentient", "sentient", "sentient…
$ classification <chr> "mammal", "mammal", "mammal", "mammal", "mamm…
$ vehicle        <chr> "X-wing", "X-wing", "Imperial shuttle", "Snow…
$ title          <chr> "A New Hope", "The Empire Strikes Back", "The…
$ episode_id     <int> 4, 5, 5, 5, 6, 6, 6, 4, 6, 4, 5, 6, 4, 5, 5, …
$ release_date   <date> 1977-05-25, 1980-05-17, 1980-05-17, 1980-05-…

inspectdf::inspect_types()

Below is the breakdown of the new sw_pilots_4thru6 columns by type.

sw_pilots_4thru6 |>
  inspectdf::inspect_types() |>
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 2)

inspectdf::inspect_cat()

Below is the breakdown of the categorical columns in sw_pilots_4thru6:

sw_pilots_4thru6 |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(text_labels = TRUE, 
    col_palette = 2)

As we can see, the inspectdf::inspect_cat() function displays the missing values in categorical columns as gray areas.

inspectdf::inspect_num()

Below is are the histograms of the numerical column in sw_pilots_4thru6:

sw_pilots_4thru6 |> 
  inspectdf::inspect_num() |> 
  inspectdf::show_plot(text_labels = TRUE, 
    col_palette = 2)

Now we can answer questions like, “what species were pilots in each film?”

sw_pilots_4thru6 |> 
  count(title, species_name) |> 
  ggplot2::ggplot(aes(
    x = title,
    y = n, )) + 
  ggplot2::geom_col(aes(fill = species_name), 
    position = "dodge2", width = 0.8) + 
  labs(title = "Starwars Pilots (Episodes 4-6)", 
    subtitle = "Breakdown by title and species", 
    x = "Count", 
    y = "Vehicle", 
    ) + 
  ggdark::dark_theme_bw()

anti_join

The anti_join() keeps all rows in dataset x that do not have a match in dataset y.

inspectdf::inspect_na()

Anti-joins are great for filtering out observations that don’t exist between two tables. The last join created empty values for vehicle, episode_id, release_date, and pilot, which we can verify with inspectdf::inspect_na() (*change the col_palette to 3):

inspectdf::inspect_types()

inspectdf::inspect_cat()

inspectdf::inspect_num()

These should only contain the titles not in sw_films.